We will begin by running a simple linear model that regresses weekly
sales onto Consumer Price Index (CPI)
# Specifying our model type and setting the computational engine
linear_model <-
linear_reg() %>%
set_engine("lm")
# Fitting the model
fit_cpi <-
linear_model %>%
fit(weekly_sales ~ cpi, data = dfw)
# Model output
summary(fit_cpi$fit)
Call:
stats::lm(formula = weekly_sales ~ cpi, data = data)
Residuals:
Min 1Q Median 3Q Max
-662386 -318443 -73868 258442 2095880
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 827280.5 21778.4 37.986 < 2e-16 ***
cpi -732.7 123.7 -5.923 3.33e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 390600 on 6433 degrees of freedom
Multiple R-squared: 0.005423, Adjusted R-squared: 0.005269
F-statistic: 35.08 on 1 and 6433 DF, p-value: 3.332e-09
In this model, a Walmart store with a theoretical square footage of
0 can expect its weekly sales to be ~$828,280 if CPI is
held constant. We also observe that the relationship between
Weekly_Sales and CPI is negative. That is, if CPI increases by one unit,
weekly sales will decrease by ~$733; and if CPI decreases by one unit,
sales would increase by ~$733.
In evaluating the model statistics, we can see an Adjusted R_Squared
value of 0.005269. In other words, this model explains only roughly 0.5%
of the variance in Walmart’s weekly sales. So, while our interpretation
of the effect of CPI on Weekly_Sales is still valid, we must
conclude that this model appears to fail in explaining the variance in
our target variable.
# We'plot the affect of CPI on sales for a few different stores in the dataset, starting with store 10.
plot_store_10 <-
dfw %>%
filter(Store == 10) %>%
ggplot(aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = 'Weekly Sales vs. CPI for Store 10', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_10)
`geom_smooth()` using formula = 'y ~ x'
plot_store_11 <-
dfw %>%
filter(Store == 11) %>%
ggplot(aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = 'Weekly Sales vs. CPI for Store 11', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_11)
`geom_smooth()` using formula = 'y ~ x'
plot_store_12 <-
dfw %>%
filter(Store == 10) %>%
ggplot(aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_12)
`geom_smooth()` using formula = 'y ~ x'
plot_store_13 <-
dfw %>%
filter(Store == 13) %>%
ggplot(aes(x = CPI, y = Weekly_Sales)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = 'Weekly Sales vs. CPI for Store 13', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_13)
`geom_smooth()` using formula = 'y ~ x'
# A plot to demonstrate the fluctuation of CPI by region/store. Note that the
# smoothed line is negative in some locales and positive in others.
animated_plot <-
dfw %>%
filter(store %in% c(11:15)) %>%
ggplot(aes(x = cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method = lm) +
labs(title = 'Weekly Sales vs. CPI for Store {closest_state}', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal() +
gganimate::transition_states(store, transition_length = 1, state_length = 2) +
gganimate::view_follow()
animated_plot
Inserting image 1 at 0.00s (1%)...
Inserting image 2 at 0.10s (2%)...
Inserting image 3 at 0.20s (3%)...
Inserting image 4 at 0.30s (4%)...
Inserting image 5 at 0.40s (5%)...
Inserting image 6 at 0.50s (6%)...
Inserting image 7 at 0.60s (7%)...
Inserting image 8 at 0.70s (8%)...
Inserting image 9 at 0.80s (9%)...
Inserting image 10 at 0.90s (10%)...
Inserting image 11 at 1.00s (11%)...
Inserting image 12 at 1.10s (12%)...
Inserting image 13 at 1.20s (13%)...
Inserting image 14 at 1.30s (14%)...
Inserting image 15 at 1.40s (15%)...
Inserting image 16 at 1.50s (16%)...
Inserting image 17 at 1.60s (17%)...
Inserting image 18 at 1.70s (18%)...
Inserting image 19 at 1.80s (19%)...
Inserting image 20 at 1.90s (20%)...
Inserting image 21 at 2.00s (21%)...
Inserting image 22 at 2.10s (22%)...
Inserting image 23 at 2.20s (23%)...
Inserting image 24 at 2.30s (24%)...
Inserting image 25 at 2.40s (25%)...
Inserting image 26 at 2.50s (26%)...
Inserting image 27 at 2.60s (27%)...
Inserting image 28 at 2.70s (28%)...
Inserting image 29 at 2.80s (29%)...
Inserting image 30 at 2.90s (30%)...
Inserting image 31 at 3.00s (31%)...
Inserting image 32 at 3.10s (32%)...
Inserting image 33 at 3.20s (33%)...
Inserting image 34 at 3.30s (34%)...
Inserting image 35 at 3.40s (35%)...
Inserting image 36 at 3.50s (36%)...
Inserting image 37 at 3.60s (37%)...
Inserting image 38 at 3.70s (38%)...
Inserting image 39 at 3.80s (39%)...
Inserting image 40 at 3.90s (40%)...
Inserting image 41 at 4.00s (41%)...
Inserting image 42 at 4.10s (42%)...
Inserting image 43 at 4.20s (43%)...
Inserting image 44 at 4.30s (44%)...
Inserting image 45 at 4.40s (45%)...
Inserting image 46 at 4.50s (46%)...
Inserting image 47 at 4.60s (47%)...
Inserting image 48 at 4.70s (48%)...
Inserting image 49 at 4.80s (49%)...
Inserting image 50 at 4.90s (50%)...
Inserting image 51 at 5.00s (51%)...
Inserting image 52 at 5.10s (52%)...
Inserting image 53 at 5.20s (53%)...
Inserting image 54 at 5.30s (54%)...
Inserting image 55 at 5.40s (55%)...
Inserting image 56 at 5.50s (56%)...
Inserting image 57 at 5.60s (57%)...
Inserting image 58 at 5.70s (58%)...
Inserting image 59 at 5.80s (59%)...
Inserting image 60 at 5.90s (60%)...
Inserting image 61 at 6.00s (61%)...
Inserting image 62 at 6.10s (62%)...
Inserting image 63 at 6.20s (63%)...
Inserting image 64 at 6.30s (64%)...
Inserting image 65 at 6.40s (65%)...
Inserting image 66 at 6.50s (66%)...
Inserting image 67 at 6.60s (67%)...
Inserting image 68 at 6.70s (68%)...
Inserting image 69 at 6.80s (69%)...
Inserting image 70 at 6.90s (70%)...
Inserting image 71 at 7.00s (71%)...
Inserting image 72 at 7.10s (72%)...
Inserting image 73 at 7.20s (73%)...
Inserting image 74 at 7.30s (74%)...
Inserting image 75 at 7.40s (75%)...
Inserting image 76 at 7.50s (76%)...
Inserting image 77 at 7.60s (77%)...
Inserting image 78 at 7.70s (78%)...
Inserting image 79 at 7.80s (79%)...
Inserting image 80 at 7.90s (80%)...
Inserting image 81 at 8.00s (81%)...
Inserting image 82 at 8.10s (82%)...
Inserting image 83 at 8.20s (83%)...
Inserting image 84 at 8.30s (84%)...
Inserting image 85 at 8.40s (85%)...
Inserting image 86 at 8.50s (86%)...
Inserting image 87 at 8.60s (87%)...
Inserting image 88 at 8.70s (88%)...
Inserting image 89 at 8.80s (89%)...
Inserting image 90 at 8.90s (90%)...
Inserting image 91 at 9.00s (91%)...
Inserting image 92 at 9.10s (92%)...
Inserting image 93 at 9.20s (93%)...
Inserting image 94 at 9.30s (94%)...
Inserting image 95 at 9.40s (95%)...
Inserting image 96 at 9.50s (96%)...
Inserting image 97 at 9.60s (97%)...
Inserting image 98 at 9.70s (98%)...
Inserting image 99 at 9.80s (99%)...
Inserting image 100 at 9.90s (100%)...
Encoding to gif... done!
What we observe here is that the impact of CPI can vary greatly by
store/region. This still aligns with our evaluation of fit_cpi because
we recall that that particular model explained only a small amount (~5%)
of the variance in Weekly_Sales, so we would expect to see these kinds
of swings. With a (much) higher Adjusted R-Squared, these variations
would look unusual.
dfw %>%
group_by(store) %>%
group_modify(~ tidy(lm(weekly_sales ~ ., data = .x))) %>%
filter(term == "cpi")
NA
# Filtering for 2012 and plotting CPI against Weekly_Sales
plot <- dfw %>%
filter(lubridate::year(date) == 2012) %>%
ggplot(aes(x=cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method=lm)
labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
NULL
plotly::ggplotly(plot)
NA
We see an interesting effect when we filter for one specific year.
The clusters are nearly vertical because CPI is calculated
geographically, with either Core Based Statistical Area (CBSA) or
Metropolitan Statistical Area (MSA). CPI might be the same in a
particular region, but different stores in that region will have
different sales volume, hence the vertical clusters.
plot_store_cpi <- dfw %>%
filter(store==10, lubridate::year(date)==2012) %>%
ggplot(aes(x=cpi, y = weekly_sales)) +
geom_point() +
geom_smooth(method=lm)
Quitting from lines 189-198 (lab-II-template.Rmd)
labs(title = 'Weekly Sales vs. CPI for Store 10 in 2012', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
theme_minimal()
NULL
plotly::ggplotly(plot_store_cpi)
Although CPI varies by region, the deviation in CPI across time for
a single region tends to be much lower, which is why we see such a slim
range here. Since CPI is a measure of inflation, we expect to see these
regional effects.
# A new iteration of the previous model that also includes store Size as an
# independent variable
options(scipen = 999)
fit_cpi_size <-
linear_model %>%
fit(weekly_sales ~ cpi + size, data = dfw)
summary(fit_cpi_size$fit)
Call:
stats::lm(formula = weekly_sales ~ cpi + size, data = data)
Residuals:
Min 1Q Median 3Q Max
-563750 -167145 -29612 112172 1912650
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 182831.50332 14966.91316 12.216 <0.0000000000000002
cpi -657.04633 76.92121 -8.542 <0.0000000000000002
size 4.84669 0.04796 101.048 <0.0000000000000002
(Intercept) ***
cpi ***
size ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 242800 on 6432 degrees of freedom
Multiple R-squared: 0.6156, Adjusted R-squared: 0.6155
F-statistic: 5151 on 2 and 6432 DF, p-value: < 0.00000000000000022
# Comparing fit_cpi to fit_size to see which is better at explaining the variance
# in Weekly_Sales
anova(fit_cpi$fit, fit_cpi_size$fit)
Analysis of Variance Table
Model 1: Weekly_Sales ~ CPI
Model 2: Weekly_Sales ~ CPI + Size
Res.Df RSS Df Sum of Sq F Pr(>F)
1 6433 9.8128e+14
2 6432 3.7924e+14 1 6.0204e+14 10211 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note also that the coefficient in the revised model has been reduced
from ~$733 to ~$657. This is simply due to the fact that size is now
explaining more of the variance that was left unexplained by the
previous model that only included CPI.
# Building a model that uses all variables EXCEPT Date and Store
fit_full <-
linear_model %>%
fit(weekly_sales ~ . - store - date, data = dfw)
summary(fit_full$fit)
Call:
stats::lm(formula = weekly_sales ~ . - store - date, data = data)
Residuals:
Min 1Q Median 3Q Max
-557148 -165608 -24125 112851 1918479
Coefficients:
Estimate Std. Error t value
(Intercept) 313268.64179 35462.52571 8.834
isholidayTRUE 60120.82723 11961.41705 5.026
temperature 1001.63556 173.85005 5.761
fuel_price -13332.38798 6821.86906 -1.954
cpi -946.07156 84.45115 -11.203
unemployment -12517.11110 1724.67790 -7.258
size 4.83971 0.04802 100.786
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
isholidayTRUE 0.00000051374714 ***
temperature 0.00000000872334 ***
fuel_price 0.0507 .
cpi < 0.0000000000000002 ***
unemployment 0.00000000000044 ***
size < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 241200 on 6428 degrees of freedom
Multiple R-squared: 0.621, Adjusted R-squared: 0.6206
F-statistic: 1755 on 6 and 6428 DF, p-value: < 0.00000000000000022
anova(fit_cpi_size$fit, fit_full$fit)
Analysis of Variance Table
Model 1: Weekly_Sales ~ CPI + Size
Model 2: Weekly_Sales ~ (Store + Date + IsHoliday + Temperature + Fuel_Price +
CPI + Unemployment + Size) - Store - Date
Res.Df RSS Df Sum of Sq F Pr(>F)
1 6432 3.7924e+14
2 6428 3.7394e+14 4 5.3028e+12 22.789 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We observe a further, though slight improvement in the Adjusted
R-Squared value in the new model that eliminates temporal and regional
effects (fit_full). The ANOVA test also confirms that the improvement in
explanatory power is indeed statistically significant.
More Linear Regression
We hypothesize that the effect of good weather is increased on
holidays. We can test this by revising fit_full and including an
interaction term.
fit_full_int <-
linear_model %>%
fit(weekly_sales ~ . - store - date + isholiday * temperature, data = dfw)
summary(fit_full_int$fit)
Call:
stats::lm(formula = weekly_sales ~ . - store - date + isholiday *
temperature, data = data)
Residuals:
Min 1Q Median 3Q Max
-557499 -165415 -24493 112914 1918376
Coefficients:
Estimate Std. Error t value
(Intercept) 314781.71540 35650.02554 8.830
isholidayTRUE 47453.37499 32654.55081 1.453
temperature 980.88764 180.84372 5.424
fuel_price -13421.90341 6825.68548 -1.966
cpi -945.95168 84.45707 -11.200
unemployment -12511.43706 1724.84245 -7.254
size 4.83969 0.04802 100.779
isholidayTRUE:temperature 247.28642 593.15059 0.417
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
isholidayTRUE 0.1462
temperature 0.000000060421363 ***
fuel_price 0.0493 *
cpi < 0.0000000000000002 ***
unemployment 0.000000000000453 ***
size < 0.0000000000000002 ***
isholidayTRUE:temperature 0.6768
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 241200 on 6427 degrees of freedom
Multiple R-squared: 0.621, Adjusted R-squared: 0.6206
F-statistic: 1504 on 7 and 6427 DF, p-value: < 0.00000000000000022
anova(fit_full$fit, fit_full_int$fit)
Analysis of Variance Table
Model 1: weekly_sales ~ (store + date + isholiday + temperature + fuel_price +
cpi + unemployment + size) - store - date
Model 2: weekly_sales ~ (store + date + isholiday + temperature + fuel_price +
cpi + unemployment + size) - store - date + isholiday * temperature
Res.Df RSS Df Sum of Sq F Pr(>F)
1 6428 373938635272534
2 6427 373928522950003 1 10112322531 0.1738 0.6768
Although the results of our fit_full_int model demonstrate that the
effect of good weather is indeed more significant on holidays, the ANOVA
test shows no statistically significant improvement. We cannot assert
definitively that this model with the interaction term is an
improvement.
We’ll also test whether the effect of temperature on weekly sales is
linear by squaring that variable.
fit_full_sq <-
linear_model %>%
fit(weekly_sales ~ . - store - date + I(temperature ^2), data = dfw)
summary(fit_full_sq$fit)
Call:
stats::lm(formula = weekly_sales ~ . - store - date + I(temperature^2),
data = data)
Residuals:
Min 1Q Median 3Q Max
-561455 -165260 -24674 112058 1911166
Coefficients:
Estimate Std. Error t value
(Intercept) 261043.07789 41108.23517 6.350
isholidayTRUE 62296.69695 11987.90765 5.197
temperature 3293.89228 930.05089 3.542
fuel_price -14713.82683 6841.25650 -2.151
cpi -954.71915 84.48673 -11.300
unemployment -12529.36093 1723.97501 -7.268
size 4.83146 0.04811 100.420
I(temperature^2) -19.82165 7.90072 -2.509
Pr(>|t|)
(Intercept) 0.000000000229814 ***
isholidayTRUE 0.000000209190349 ***
temperature 0.0004 ***
fuel_price 0.0315 *
cpi < 0.0000000000000002 ***
unemployment 0.000000000000409 ***
size < 0.0000000000000002 ***
I(temperature^2) 0.0121 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 241100 on 6427 degrees of freedom
Multiple R-squared: 0.6214, Adjusted R-squared: 0.621
F-statistic: 1507 on 7 and 6427 DF, p-value: < 0.00000000000000022
anova(fit_full$fit, fit_full_sq$fit)
Analysis of Variance Table
Model 1: weekly_sales ~ (store + date + isholiday + temperature + fuel_price +
cpi + unemployment + size) - store - date
Model 2: weekly_sales ~ (store + date + isholiday + temperature + fuel_price +
cpi + unemployment + size) - store - date + I(temperature^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 6428 373938635272534
2 6427 373572776702821 1 365858569713 6.2943 0.01214 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## Plotting the relationship between Temperature^2 and Weekly Sales
dfw %>%
ggplot(aes(x = temperature, y = weekly_sales)) +
geom_smooth(method = "lm", formula = y ~ x + I(x^2))

The model output demonstrates a curvilinear, or inverted U-shaped
relationship (visualized below). People are less likely to shop retail
on a freezing cold day. Increasing temperatures are associated with
increased sales, but only to a point. As temperatures become excessive
and dangerous, sales start to decrease.
If we were managing Walmart’s promotions we could offer larger
discounts when the whether is at either extreme and perhaps even
increase the price of certain products when the temperature is
mild.
Predictive Analytics
Now that we have a model that is fairly robust we will use it to
make predictions of weekly sales revenue.
# Setting seed for reproducibility
set.seed(3.14159)
# Splitting the data set into a training dataset (75%) and a test dataset (25%)
dfw_split <- initial_split(dfw)
dfw_train <- training(dfw_split)
dfw_test <- testing(dfw_split)
# Fitting the model
fit_org <-
linear_model %>%
fit(weekly_sales ~ . - date - store + I(temperature^2), data = dfw_train)
summary(fit_org$fit)
Call:
stats::lm(formula = weekly_sales ~ . - date - store + I(temperature^2),
data = data)
Residuals:
Min 1Q Median 3Q Max
-557260 -165114 -25112 115048 1913671
Coefficients:
Estimate Std. Error t value
(Intercept) 254646.56714 47252.27399 5.389
isholidayTRUE 60380.17646 13967.26022 4.323
temperature 3056.33287 1068.34983 2.861
fuel_price -19393.30976 7819.05402 -2.480
cpi -921.70456 96.40208 -9.561
unemployment -10579.94848 1991.83421 -5.312
size 4.82603 0.05496 87.809
I(temperature^2) -16.27913 9.05811 -1.797
Pr(>|t|)
(Intercept) 0.0000000742 ***
isholidayTRUE 0.0000157042 ***
temperature 0.00424 **
fuel_price 0.01316 *
cpi < 0.0000000000000002 ***
unemployment 0.0000001135 ***
size < 0.0000000000000002 ***
I(temperature^2) 0.07237 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 239000 on 4818 degrees of freedom
Multiple R-squared: 0.6248, Adjusted R-squared: 0.6242
F-statistic: 1146 on 7 and 4818 DF, p-value: < 0.00000000000000022
# The linear regression output as a tibble
tidy(fit_org)
# Creating a new dataframe with predicted values
results_org <-
predict(fit_org, new_data = dfw_test) %>%
bind_cols(dfw_test) %>%
rename(Predicted_Sales = .pred)
results_org %>%
arrange(date)
NA
# Defining the metric set we will be working with to evaluate the models
perf_metrics <- metric_set(rmse, mae)
# Calculating the performance of fit
perf_metrics(results_org, truth = weekly_sales, estimate = Predicted_Sales)
When we remove the temperature term, something notable occurs.
First, our Adjusted R-Squared value diminishes slighlty (from 62.2% to
62.1%), making it a slightly less appealing model in terms of explaining
the variance in weekly sales. However, we also observe that the error
has been reduced, making fit_nosq relatively superior in terms of
predictive capability. Since we are trying to build a reliable
predictive model, we exclude the term and conclude that
fit_nosq is better for that purpose.
More Predictive Modeling
We are fairly pleased with both the explanatory and predictive power
of fit_nosq but of course we would like to improve upon both metrics.
One issue that we have not yet discussed is the variability in weekly
sales across Walmart locations, as shown below.
# Calculating total weekly sales per store
sales_by_store <- aggregate(weekly_sales ~ store, data = dfw, sum)
# A bar chart showing the distribution of weekly sales revenue by store
ggplot(sales_by_store, aes(x = store, y = weekly_sales)) +
geom_bar(stat = "identity", fill = "#0078D4") +
labs(title = "Total Weekly Sales by Store", x = "Store Number", y = "Total Weekly Sales")

This final model uses a log-linear regression to explain weekly
Walmart sales using store-level, economic, and seasonal predictors.
Applying a log transformation to weekly sales substantially improves
model performance, yielding an adjusted R² of 0.71, and stabilizes
variance across stores with vastly different revenue scales. Overall
model fit is strong, with well-behaved residuals and a highly
significant F-statistic, indicating that the included predictors jointly
explain a meaningful share of sales variation.
Results show that store size is the dominant driver of weekly sales,
dwarfing macroeconomic effects and confirming that physical scale
largely determines revenue potential. Holiday weeks are associated with
an average 6–7% increase in sales, validating the importance of seasonal
demand spikes. Inflation, proxied by CPI, has a small but statistically
significant negative relationship with sales, even after controlling for
store characteristics. Temperature and unemployment exhibit modest
effects consistent with economic intuition, while fuel prices do not
appear to meaningfully impact sales once other factors are accounted
for.
This specification represents the best balance between
interpretability, explanatory power, and robustness among the models
tested. The log transformation enables clear percentage-based
interpretations while materially improving fit relative to linear
alternatives, making the model suitable for both analytical insight and
downstream forecasting.
Limitiations
- The model does not explicitly account for store-level fixed
effects or regional hierarchies, which may mask persistent
location-specific dynamics.
- Temporal structure is handled implicitly; autocorrelation and
seasonality are not directly modeled.
- The analysis assumes linear relationships on the log scale and may
understate nonlinear or interaction effects beyond those tested.
- CPI and unemployment are measured at broader geographic levels and
may not fully capture local economic conditions.
Next Steps
- Implement mixed-effects (hierarchical) models to capture
store-specific variation.
- Explore time-series approaches for improved short-term
forecasting.
---
title: "Walmart's Weekly Sales Linear Regression"
output:
  html_document:
    df_print: paged
  pdf_document:
    latex_engine: xelatex
  html_notebook: default
always_allow_html: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

```

***

```{r}
library("tidyverse")
library("tidymodels")
library("tidylog")

```


```{r}

# Reading in the Walmart dataset:

dfw <- read_csv("data/walmart.csv") %>% 
  rename_with(tolower) %>% 
  arrange(store)


head(dfw)

```
```{r}
#An overview of our dataset's structure

str(dfw)
```

```{r}
#An overview of the data

head(dfw)
```

```{r}

#The structure of the data and each column's data type.  Note that isholiday is a logical (i.e., true/false) predictor variable

str(dfw)

```


<h1 style="text-align:center;">Simple Linear Regression Model</h1>

#### We will begin by running a simple linear model that regresses weekly sales onto Consumer Price Index (CPI)


```{r}
# Specifying our model type and setting the computational engine

linear_model <- 
  linear_reg() %>% 
  set_engine("lm")

# Fitting the model

fit_cpi <- 
  linear_model %>% 
  fit(weekly_sales ~ cpi, data = dfw)
  
# Model output 

summary(fit_cpi$fit)

```


##### In this model, a Walmart store with a theoretical square footage of 0 can expect its weekly sales to be **~$828,280** if CPI is held constant.  We also observe that the relationship between Weekly_Sales and CPI is negative.  That is, if CPI increases by one unit, weekly sales will decrease by ~$733; and if CPI decreases by one unit, sales would increase by ~$733.


##### In evaluating the model statistics, we can see an Adjusted R_Squared value of 0.005269. In other words, this model explains only roughly 0.5% of the variance in Walmart's weekly sales.  So, while our interpretation of the *effect* of CPI on Weekly_Sales is still valid, we must conclude that this model appears to fail in explaining the variance in our target variable.


```{r}
# Now we will plot the affect of CPI on sales for a few different stores in the dataset, starting with store 10.

plot_store_10 <- 
  dfw %>% 
  filter(store == 10) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 10', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_10)


plot_store_11 <- 
  dfw %>% 
  filter(store == 11) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 11', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_11)

plot_store_12 <- 
  dfw %>% 
  filter(store == 10) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_12)

plot_store_13 <- 
  dfw %>% 
  filter(store == 13) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 13', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_13)

```

```{r}

# A plot to demonstrate the fluctuation of CPI by region/store.  Note that the
# smoothed line is negative in some locales and positive in others.

animated_plot <- 
    dfw %>% 
    filter(store %in% c(11:15)) %>% 
    ggplot(aes(x = cpi, y = weekly_sales)) + 
    geom_point() + 
    geom_smooth(method = lm) +
    labs(title = 'Weekly Sales vs. CPI for Store {closest_state}', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
    theme_minimal() + 
    gganimate::transition_states(store, transition_length = 1, state_length = 2) +
    gganimate::view_follow()

animated_plot
```


##### What we observe here is that the impact of CPI can vary greatly by store/region.  This still aligns with our evaluation of fit_cpi because we recall that that particular model explained only a small amount (~5%) of the variance in Weekly_Sales, so we would expect to see these kinds of swings.  With a (much) higher Adjusted R-Squared, these variations would look unusual.


```{r}

dfw %>%
  group_by(store) %>%
  group_modify(~ tidy(lm(weekly_sales ~ ., data = .x))) %>%
  filter(term == "cpi")

```


```{r}
# Filtering for 2012 and plotting CPI against Weekly_Sales

plot <- dfw %>%
	      	filter(lubridate::year(date) == 2012) %>%
	          ggplot(aes(x=cpi, y = weekly_sales)) +
	          geom_point() +
 	        	geom_smooth(method=lm)
    labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
    
plotly::ggplotly(plot)

```


##### We see an interesting effect when we filter for one specific year. The clusters are nearly vertical because CPI is calculated geographically, with either Core Based Statistical Area (CBSA) or Metropolitan Statistical Area (MSA).  CPI might be the same in a particular region, but different stores in that region will have different sales volume, hence the vertical clusters.


```{r}
# Now let's look exclusively at store 10

plot_store_cpi <- dfw %>%
      		filter(store==10, lubridate::year(date)==2012) %>%
      	    ggplot(aes(x=cpi, y = weekly_sales)) +
      	    geom_point() +
            geom_smooth(method=lm)
    labs(title = 'Weekly Sales vs. CPI for Store 10 in 2012', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_cpi)
```


##### Although CPI varies by region, the deviation in CPI across time for a single region tends to be much lower, which is why we see such a slim range here. Since CPI is a measure of inflation, we expect to see these regional effects.


```{r}

# A new iteration of the previous model that also includes store Size as an independent variable

fit_cpi_size <- 
  linear_model %>% 
  fit(weekly_sales ~ cpi + size, data = dfw)

summary(fit_cpi_size$fit)

```

```{r}
# Comparing fit_cpi to fit_size to see which is better at explaining the variance in Weekly_Sales

anova(fit_cpi$fit, fit_cpi_size$fit)
```


##### The model that includes size as a predictor variable (fit_cpi_size) appears to perform significantly better than fit_cpi.  Adjusted R-Square now explains ~62% of the variance in rentals and the ANOVA test confirms that including size is statistically significant.


```{r}
tidy(fit_cpi$fit)
tidy(fit_cpi_size$fit)
```


##### Note also that the coefficient in the revised model has been reduced from ~$733 to ~$657.  This is simply due to the fact that size is now explaining more of the variance that was left unexplained by the previous model that only included CPI.


```{r}
# Building a model that uses all variables EXCEPT Date and Store

fit_full <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date, data = dfw)

summary(fit_full$fit)

```
```{r}
anova(fit_cpi_size$fit, fit_full$fit)
```


##### We observe a further, though slight improvement in the Adjusted R-Squared value in the new model that eliminates temporal and regional effects (fit_full).  The ANOVA test also confirms that the improvement in explanatory power is indeed statistically significant.

<h1 style="text-align:center;">More Linear Regression</h1>

##### We hypothesize that the effect of good weather is increased on holidays.  We can test this by revising fit_full and including an interaction term.


```{r}
fit_full_int <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date + isholiday * temperature, data = dfw)

summary(fit_full_int$fit)
```
```{r}
anova(fit_full$fit, fit_full_int$fit)
```


##### Although the results of our fit_full_int model demonstrate that the effect of good weather is indeed more significant on holidays, the ANOVA test shows no statistically significant improvement.  We cannot assert definitively that this model with the interaction term is an improvement.

##### We'll also test whether the effect of temperature on weekly sales is linear by squaring that variable.


```{r}
fit_full_sq <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date + I(temperature ^2), data = dfw)

summary(fit_full_sq$fit)
```
```{r}
anova(fit_full$fit, fit_full_sq$fit)
```

```{r}
## Plotting the relationship between Temperature^2 and Weekly Sales 

dfw %>%
  ggplot(aes(x = temperature, y = weekly_sales)) + 
  geom_smooth(method = "lm", formula = y ~ x + I(x^2))

```


##### The model output demonstrates a curvilinear, or inverted U-shaped relationship (visualized below).  People are less likely to shop retail on a freezing cold day. Increasing temperatures are associated with increased sales, but only to a point.  As temperatures become excessive and dangerous, sales start to decrease.

##### If we were managing Walmart's promotions we could offer larger discounts when the whether is at either extreme and perhaps even increase the price of certain products when the temperature is mild.

<h1 style="text-align:center;">Predictive Analytics</h1>

##### Now that we have a model that is fairly robust we will use it to make predictions of weekly sales revenue.


```{r}
# Setting seed for reproducibility

set.seed(3.14159)

# Splitting the data set into a training dataset (75%) and a test dataset (25%)

dfw_split <- initial_split(dfw)

dfw_train <-  training(dfw_split)

dfw_test <- testing(dfw_split)
```

```{r}
# Fitting the model

fit_org <- 
  linear_model %>% 
  fit(weekly_sales ~ . - date - store + I(temperature^2), data = dfw_train)

summary(fit_org$fit)
  
```

```{r}
# The linear regression output as a tibble

tidy(fit_org)
```
```{r}

# Creating a new dataframe with predicted values

results_org <-
  predict(fit_org, new_data = dfw_test) %>% 
  bind_cols(dfw_test) %>% 
  rename(Predicted_Sales = .pred)

results_org %>% 
  arrange(date)

```

```{r}
# Defining the metric set we will be working with to evaluate the models

perf_metrics <- metric_set(rmse, mae)
```


```{r}

# Calculating the performance of fit

perf_metrics(results_org, truth =  weekly_sales, estimate = Predicted_Sales)
```


##### These metrics indicate that our model is off by ~$240,424 according to RMSE and ~$179,092 MAE.  These numbers appear alarming until one recalls that the range of weekly sales by Walmart store location is about $70k to $2.8 million, with a mean of $740k and a median of $689k.


```{r}
#Building the model without I(Temperature^2) variable using only the training data set

fit_org_nosq <-
  linear_model %>% 
  fit(weekly_sales ~ . - store - date, data = dfw_train)

```

```{r}
summary(fit_org_nosq$fit)
```


```{r}
#Comparing the models

anova(fit_org$fit, fit_org_nosq$fit)
```


```{r}
#Creating a new dataframe results_org_nosq with predictions

results_org_nosq <-
  predict(fit_org_nosq, new_data = dfw_test) %>% 
  bind_cols(dfw_test) %>% 
  rename(Predicted_Sales = .pred)

#Calculating performance metrics

perf_metrics(results_org_nosq, truth =  weekly_sales, estimate = Predicted_Sales)
```


##### When we remove the temperature term, something notable occurs.  First, our Adjusted R-Squared value diminishes slighlty (from 62.2% to 62.1%), making it a slightly less appealing model in terms of explaining the variance in weekly sales.  However, we also observe that the error has been reduced, making fit_nosq relatively superior in terms of predictive capability. Since we are trying to build a reliable *predictive* model, we exclude the term and conclude that fit_nosq is better for that purpose.

<h1 style="text-align:center;">More Predictive Modeling</h1>

##### We are fairly pleased with both the explanatory and predictive power of fit_nosq but of course we would like to improve upon both metrics.  One issue that we have not yet discussed is the variability in weekly sales across Walmart locations, as shown below.


```{r}
# Calculating total weekly sales per store

sales_by_store <- aggregate(weekly_sales ~ store, data = dfw, sum)

# A bar chart showing the distribution of weekly sales revenue by store

ggplot(sales_by_store, aes(x = store, y = weekly_sales)) +
  geom_bar(stat = "identity", fill = "#0078D4") +
  labs(title = "Total Weekly Sales by Store", x = "Store Number", y = "Total Weekly Sales")

```


##### Viewing the bar chart, we can see that standardizing the weekly_sales variable could improve our model.  One way we could do this is by transforming the scale of weekly sales, transforming each value to its natural logarithmic value.  We'll use the log() function to accomplish this.


```{r}
# Log model with all other variables unchanged
dfw_log <-
  dfw %>% 
  mutate(log_sales = log(weekly_sales))

dfw_log
```
```{r}
set.seed(3.14159)

dfwlog_split <- initial_split(dfw_log)
dfwlog_train <- training(dfwlog_split)
dfwlog_test <- testing(dfwlog_split)
```

```{r}
fit_log <- 
  linear_model %>% 
  fit(log_sales ~ . - store - date - weekly_sales, data=dfwlog_train)

summary(fit_log$fit)
```
##### This final model uses a log-linear regression to explain weekly Walmart sales using store-level, economic, and seasonal predictors. Applying a log transformation to weekly sales substantially improves model performance, yielding an adjusted R² of 0.71, and stabilizes variance across stores with vastly different revenue scales. Overall model fit is strong, with well-behaved residuals and a highly significant F-statistic, indicating that the included predictors jointly explain a meaningful share of sales variation.
```{r}
```
##### Results show that store size is the dominant driver of weekly sales, dwarfing macroeconomic effects and confirming that physical scale largely determines revenue potential. Holiday weeks are associated with an average 6–7% increase in sales, validating the importance of seasonal demand spikes. Inflation, proxied by CPI, has a small but statistically significant negative relationship with sales, even after controlling for store characteristics. Temperature and unemployment exhibit modest effects consistent with economic intuition, while fuel prices do not appear to meaningfully impact sales once other factors are accounted for.
```{r}
```
##### This specification represents the best balance between interpretability, explanatory power, and robustness among the models tested. The log transformation enables clear percentage-based interpretations while materially improving fit relative to linear alternatives, making the model suitable for both analytical insight and downstream forecasting.

<h1 style="text-align:center;">Limitiations</h1>

##### - The model does not explicitly account for store-level fixed effects or regional hierarchies, which may mask persistent location-specific dynamics.

##### - Temporal structure is handled implicitly; autocorrelation and seasonality are not directly modeled.

##### - The analysis assumes linear relationships on the log scale and may understate nonlinear or interaction effects beyond those tested.

##### - CPI and unemployment are measured at broader geographic levels and may not fully capture local economic conditions.

<h1 style="text-align:center;">Next Steps</h1>

##### - Implement mixed-effects (hierarchical) models to capture store-specific variation.

##### - Explore time-series approaches for improved short-term forecasting.

##### - Incorporate promotional data, foot traffic, or local demographic variables to enhance predictive accuracy.
